Desarrollo notebook 2¶
Valores missing, outlier y correlaciones¶
En este notebook se realizara el estudio y preprocesamiento de las variables categóricas, continuas y booleanas, de acuerdo con la siguiente estrcutura:¶
Asignación del tipo de variable
- Conversión de tipo de datos
Separación en train y test estratificado
Visualización descriptiva de los datos
Gráficos de distribución de las variables
Tratamiento de variables continuas
- Gráfico de correlación
- Tratamiento de valores nulos
- Imputar valores nulos
Tratamiento de variables categóricas y booleanas
- Tratamiento de valores nulos
- Imputar valores nulos
Importar librerías¶
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import sklearn
from sklearn.impute import KNNImputer
import scipy.stats as ss
import warnings
from sklearn.model_selection import train_test_split
import sys
sys.path.append('/Users/miguelflores/Desktop/P1/practica1')
from funciones import funciones_auxiliares as f_aux
semilla = 42
pd.set_option("display.max_rows", 10000)
pd.set_option("display.max_columns", 10000)
pd.set_option("display.width", 10000)
Lectura de datos del preprocesado inicial¶
In [2]:
df = pd.read_csv('... /data/pd_data_initial_preprocessing.csv').set_index('SK_ID_CURR')
df
Out[2]:
| TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | NWEEKDAY_PROCESS_START | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.0220 | 0.0198 | 0.0 | 0.0000 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0149 | Stone, brick | 0 | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 |
| 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.0790 | 0.0554 | 0.0 | 0.0000 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.0100 | reg oper account | block of flats | 0.0714 | Block | 0 | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 3 |
| 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | 0.032561 | -9327 | -236 | -8456.0 | -1982 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1.0 | 1 | 1 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.145570 | 0.681632 | NaN | 0.2021 | 0.0887 | 0.9876 | 0.8300 | 0.0202 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0594 | 0.1484 | 0.1965 | 0.0753 | 0.1095 | 0.1008 | 0.0172 | 0.9782 | 0.7125 | 0.0172 | 0.0806 | 0.0345 | 0.4583 | 0.0417 | 0.0094 | 0.0882 | 0.0853 | 0.0 | 0.0125 | 0.2040 | 0.0887 | 0.9876 | 0.8323 | 0.0203 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0605 | 0.1509 | 0.2001 | 0.0757 | 0.1118 | reg oper account | block of flats | 0.2898 | Stone, brick | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -273.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 4 |
| 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | 0.025164 | -20775 | 365243 | -4388.0 | -4090 | NaN | 1 | 0 | 0 | 1 | 1 | 0 | NaN | 1.0 | 2 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.115992 | NaN | 0.0247 | 0.0435 | 0.9727 | 0.6260 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0579 | 0.0202 | 0.0257 | 0.0000 | 0.0000 | 0.0252 | 0.0451 | 0.9727 | 0.6406 | 0.0022 | 0.0000 | 0.1034 | 0.0833 | 0.1250 | 0.0592 | 0.0220 | 0.0267 | 0.0 | 0.0000 | 0.0250 | 0.0435 | 0.9727 | 0.6310 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0589 | 0.0205 | 0.0261 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0214 | Stone, brick | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | Unaccompanied | Working | Higher education | Separated | House / apartment | 0.005002 | -14966 | -7921 | -6737.0 | -5150 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | Managers | 1.0 | 3 | 3 | 9 | 0 | 0 | 0 | 0 | 1 | 1 | School | 0.744026 | 0.535722 | 0.218859 | 0.1031 | 0.0862 | 0.9816 | 0.7484 | 0.0123 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0841 | 0.9279 | 0.0000 | 0.0000 | 0.1050 | 0.0894 | 0.9816 | 0.7583 | 0.0124 | 0.0000 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0918 | 0.9667 | 0.0 | 0.0000 | 0.1041 | 0.0862 | 0.9816 | 0.7518 | 0.0124 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0855 | 0.9445 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.7970 | Panel | 0 | 6.0 | 0.0 | 6.0 | 0.0 | -1909.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 4 |
| 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.005313 | -11961 | -4786 | -2562.0 | -931 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | NaN | 0.514163 | 0.661024 | 0.0124 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0061 | NaN | NaN | 0.0126 | NaN | 0.9772 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0063 | NaN | NaN | 0.0125 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0062 | NaN | NaN | NaN | block of flats | 0.0086 | Stone, brick | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -322.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
| 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.046220 | -16856 | -1262 | -5128.0 | -410 | NaN | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.0 | 1 | 1 | 20 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.734460 | 0.708569 | 0.113922 | 0.0742 | 0.0526 | 0.9881 | NaN | 0.0176 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0791 | NaN | 0.0000 | 0.0756 | 0.0546 | 0.9881 | NaN | 0.0178 | 0.0806 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0824 | NaN | 0.0000 | 0.0749 | 0.0526 | 0.9881 | NaN | 0.0177 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0805 | NaN | 0.0000 | NaN | block of flats | 0.0718 | Panel | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -787.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 4 |
307511 rows × 121 columns
Asignación de tipo de variable (Categórica, Continua y Booleana)¶
A continuación, como previamente se había visualizado en el notebook 1, se realizará una categorización por cada tipo de variable, introduciendolas a listas, para posteriormente asignar el tipo de estas.¶
In [3]:
f_aux.clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'] ============================================================================================================================================================================ Variables Categóricas: 14 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE'] ============================================================================================================================================================================ Variables Continuas: 65 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] ============================================================================================================================================================================ Variables no clasificadas: 6 ['CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', 'NWEEKDAY_PROCESS_START']
Out[3]:
(['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE'], ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'], ['CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', 'NWEEKDAY_PROCESS_START'])
In [4]:
f_aux.nueva_clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'] ============================================================================================================================================================================ Variables Categóricas: 16 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START'] ============================================================================================================================================================================ Variables Continuas: 69 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START'] ============================================================================================================================================================================= Variables no clasificadas: 0 []
Out[4]:
(['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START'], ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START'], [])
In [5]:
lista_var_bool, lista_var_cat, lista_var_con, lista_var_no_clasificadas = f_aux.nueva_clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'] ============================================================================================================================================================================ Variables Categóricas: 16 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START'] ============================================================================================================================================================================ Variables Continuas: 69 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START'] ============================================================================================================================================================================= Variables no clasificadas: 0 []
Conversión de tipo de datos¶
In [6]:
df[lista_var_cat] = df[lista_var_cat].astype("category")
df[lista_var_con] = df[lista_var_con].astype(float)
df[lista_var_con] = df[lista_var_con].apply(pd.to_numeric, errors='coerce')
df['TARGET'] = df['TARGET'].astype(int)
df.dtypes
Out[6]:
TARGET int64 NAME_CONTRACT_TYPE category CODE_GENDER category FLAG_OWN_CAR object FLAG_OWN_REALTY object CNT_CHILDREN category AMT_INCOME_TOTAL float64 AMT_CREDIT float64 AMT_ANNUITY float64 AMT_GOODS_PRICE float64 NAME_TYPE_SUITE category NAME_INCOME_TYPE category NAME_EDUCATION_TYPE category NAME_FAMILY_STATUS category NAME_HOUSING_TYPE category REGION_POPULATION_RELATIVE float64 DAYS_BIRTH float64 DAYS_EMPLOYED float64 DAYS_REGISTRATION float64 DAYS_ID_PUBLISH float64 OWN_CAR_AGE float64 FLAG_MOBIL int64 FLAG_EMP_PHONE int64 FLAG_WORK_PHONE int64 FLAG_CONT_MOBILE int64 FLAG_PHONE int64 FLAG_EMAIL int64 OCCUPATION_TYPE category CNT_FAM_MEMBERS float64 REGION_RATING_CLIENT category REGION_RATING_CLIENT_W_CITY category HOUR_APPR_PROCESS_START float64 REG_REGION_NOT_LIVE_REGION int64 REG_REGION_NOT_WORK_REGION int64 LIVE_REGION_NOT_WORK_REGION int64 REG_CITY_NOT_LIVE_CITY int64 REG_CITY_NOT_WORK_CITY int64 LIVE_CITY_NOT_WORK_CITY int64 ORGANIZATION_TYPE category EXT_SOURCE_1 float64 EXT_SOURCE_2 float64 EXT_SOURCE_3 float64 APARTMENTS_AVG float64 BASEMENTAREA_AVG float64 YEARS_BEGINEXPLUATATION_AVG float64 YEARS_BUILD_AVG float64 COMMONAREA_AVG float64 ELEVATORS_AVG float64 ENTRANCES_AVG float64 FLOORSMAX_AVG float64 FLOORSMIN_AVG float64 LANDAREA_AVG float64 LIVINGAPARTMENTS_AVG float64 LIVINGAREA_AVG float64 NONLIVINGAPARTMENTS_AVG float64 NONLIVINGAREA_AVG float64 APARTMENTS_MODE float64 BASEMENTAREA_MODE float64 YEARS_BEGINEXPLUATATION_MODE float64 YEARS_BUILD_MODE float64 COMMONAREA_MODE float64 ELEVATORS_MODE float64 ENTRANCES_MODE float64 FLOORSMAX_MODE float64 FLOORSMIN_MODE float64 LANDAREA_MODE float64 LIVINGAPARTMENTS_MODE float64 LIVINGAREA_MODE float64 NONLIVINGAPARTMENTS_MODE float64 NONLIVINGAREA_MODE float64 APARTMENTS_MEDI float64 BASEMENTAREA_MEDI float64 YEARS_BEGINEXPLUATATION_MEDI float64 YEARS_BUILD_MEDI float64 COMMONAREA_MEDI float64 ELEVATORS_MEDI float64 ENTRANCES_MEDI float64 FLOORSMAX_MEDI float64 FLOORSMIN_MEDI float64 LANDAREA_MEDI float64 LIVINGAPARTMENTS_MEDI float64 LIVINGAREA_MEDI float64 NONLIVINGAPARTMENTS_MEDI float64 NONLIVINGAREA_MEDI float64 FONDKAPREMONT_MODE category HOUSETYPE_MODE category TOTALAREA_MODE float64 WALLSMATERIAL_MODE category EMERGENCYSTATE_MODE int64 OBS_30_CNT_SOCIAL_CIRCLE float64 DEF_30_CNT_SOCIAL_CIRCLE float64 OBS_60_CNT_SOCIAL_CIRCLE float64 DEF_60_CNT_SOCIAL_CIRCLE float64 DAYS_LAST_PHONE_CHANGE float64 FLAG_DOCUMENT_2 int64 FLAG_DOCUMENT_3 int64 FLAG_DOCUMENT_4 int64 FLAG_DOCUMENT_5 int64 FLAG_DOCUMENT_6 int64 FLAG_DOCUMENT_7 int64 FLAG_DOCUMENT_8 int64 FLAG_DOCUMENT_9 int64 FLAG_DOCUMENT_10 int64 FLAG_DOCUMENT_11 int64 FLAG_DOCUMENT_12 int64 FLAG_DOCUMENT_13 int64 FLAG_DOCUMENT_14 int64 FLAG_DOCUMENT_15 int64 FLAG_DOCUMENT_16 int64 FLAG_DOCUMENT_17 int64 FLAG_DOCUMENT_18 int64 FLAG_DOCUMENT_19 int64 FLAG_DOCUMENT_20 int64 FLAG_DOCUMENT_21 int64 AMT_REQ_CREDIT_BUREAU_HOUR float64 AMT_REQ_CREDIT_BUREAU_DAY float64 AMT_REQ_CREDIT_BUREAU_WEEK float64 AMT_REQ_CREDIT_BUREAU_MON float64 AMT_REQ_CREDIT_BUREAU_QRT float64 AMT_REQ_CREDIT_BUREAU_YEAR float64 NWEEKDAY_PROCESS_START category dtype: object
Separación en train y test estratificado¶
El propósito de este paso, es asegurar que las proporciones se mantengan equilibradas entre el conjunto de entrenamiento y el de prueba. Debido a que con esto se genera una mejor representatividad de los datos, permitiendo una evaluación más precisa del modelo.¶
In [7]:
X = df.drop('TARGET', axis=1) # Eliminar la columna 'TARGET' del conjunto de características
y = df['TARGET'] # Guardar la columna 'TARGET' como variable objetivo
In [8]:
X_pd_loan, X_pd_loan_test, y_pd_loan, y_pd_loan_test = train_test_split(X, y,
stratify=df['TARGET'],
test_size=0.2, random_state = semilla)
df_train = pd.concat([X_pd_loan, y_pd_loan],axis=1)
df_test = pd.concat([X_pd_loan_test, y_pd_loan_test],axis=1)
print('== Train\n', df_train['TARGET'].value_counts(normalize=True))
print('== Test\n', df_test['TARGET'].value_counts(normalize=True))
== Train TARGET 0 0.919271 1 0.080729 Name: proportion, dtype: float64 == Test TARGET 0 0.919272 1 0.080728 Name: proportion, dtype: float64
En esta sección, se utiliza una semilla definida al inicio del notebook para garantizar la reproducibilidad y consistencia en el proceso de división de los datos en conjuntos de entrenamiento y prueba. Esto asegura que los resultados obtenidos sean replicables en futuras ejecuciones del mismo código."¶
Visualización descriptiva de los datos¶
Por medio de las funciones nulos_columna( ) y nulos_filas( ), podemos analizar la consistencia de los datos, al identificar la cantidad de valores nulos por variable. Lo cual nos permite evaluar qué variables podrían aportar más al modelo y cuáles podrían tener un impacto limitado debido a su alto porcentaje de valores nulos.¶
In [9]:
f_aux.nulos_columna(df)
Out[9]:
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| COMMONAREA_MODE | 214865 | 69.872297 |
| COMMONAREA_MEDI | 214865 | 69.872297 |
| COMMONAREA_AVG | 214865 | 69.872297 |
| NONLIVINGAPARTMENTS_MEDI | 213514 | 69.432963 |
| NONLIVINGAPARTMENTS_MODE | 213514 | 69.432963 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 69.432963 |
| FONDKAPREMONT_MODE | 210295 | 68.386172 |
| LIVINGAPARTMENTS_AVG | 210199 | 68.354953 |
| LIVINGAPARTMENTS_MEDI | 210199 | 68.354953 |
| LIVINGAPARTMENTS_MODE | 210199 | 68.354953 |
| FLOORSMIN_MODE | 208642 | 67.848630 |
| FLOORSMIN_MEDI | 208642 | 67.848630 |
| FLOORSMIN_AVG | 208642 | 67.848630 |
| YEARS_BUILD_AVG | 204488 | 66.497784 |
| YEARS_BUILD_MEDI | 204488 | 66.497784 |
| YEARS_BUILD_MODE | 204488 | 66.497784 |
| OWN_CAR_AGE | 202929 | 65.990810 |
| LANDAREA_MEDI | 182590 | 59.376738 |
| LANDAREA_AVG | 182590 | 59.376738 |
| LANDAREA_MODE | 182590 | 59.376738 |
| BASEMENTAREA_MEDI | 179943 | 58.515956 |
| BASEMENTAREA_MODE | 179943 | 58.515956 |
| BASEMENTAREA_AVG | 179943 | 58.515956 |
| EXT_SOURCE_1 | 173378 | 56.381073 |
| NONLIVINGAREA_AVG | 169682 | 55.179164 |
| NONLIVINGAREA_MEDI | 169682 | 55.179164 |
| NONLIVINGAREA_MODE | 169682 | 55.179164 |
| ELEVATORS_AVG | 163891 | 53.295980 |
| ELEVATORS_MEDI | 163891 | 53.295980 |
| ELEVATORS_MODE | 163891 | 53.295980 |
| WALLSMATERIAL_MODE | 156341 | 50.840783 |
| APARTMENTS_AVG | 156061 | 50.749729 |
| APARTMENTS_MODE | 156061 | 50.749729 |
| APARTMENTS_MEDI | 156061 | 50.749729 |
| ENTRANCES_MODE | 154828 | 50.348768 |
| ENTRANCES_AVG | 154828 | 50.348768 |
| ENTRANCES_MEDI | 154828 | 50.348768 |
| LIVINGAREA_MODE | 154350 | 50.193326 |
| LIVINGAREA_AVG | 154350 | 50.193326 |
| LIVINGAREA_MEDI | 154350 | 50.193326 |
| HOUSETYPE_MODE | 154297 | 50.176091 |
| FLOORSMAX_MEDI | 153020 | 49.760822 |
| FLOORSMAX_AVG | 153020 | 49.760822 |
| FLOORSMAX_MODE | 153020 | 49.760822 |
| YEARS_BEGINEXPLUATATION_MEDI | 150007 | 48.781019 |
| YEARS_BEGINEXPLUATATION_MODE | 150007 | 48.781019 |
| YEARS_BEGINEXPLUATATION_AVG | 150007 | 48.781019 |
| TOTALAREA_MODE | 148431 | 48.268517 |
| OCCUPATION_TYPE | 96391 | 31.345545 |
| EXT_SOURCE_3 | 60965 | 19.825307 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_MON | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 13.501631 |
| NAME_TYPE_SUITE | 1292 | 0.420148 |
| DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| EXT_SOURCE_2 | 660 | 0.214626 |
| AMT_GOODS_PRICE | 278 | 0.090403 |
| AMT_ANNUITY | 12 | 0.003902 |
| CNT_FAM_MEMBERS | 2 | 0.000650 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000325 |
| FLAG_DOCUMENT_5 | 0 | 0.000000 |
| FLAG_DOCUMENT_6 | 0 | 0.000000 |
| FLAG_DOCUMENT_7 | 0 | 0.000000 |
| FLAG_DOCUMENT_8 | 0 | 0.000000 |
| FLAG_DOCUMENT_4 | 0 | 0.000000 |
| FLAG_DOCUMENT_12 | 0 | 0.000000 |
| FLAG_DOCUMENT_3 | 0 | 0.000000 |
| FLAG_DOCUMENT_2 | 0 | 0.000000 |
| FLAG_DOCUMENT_11 | 0 | 0.000000 |
| FLAG_DOCUMENT_21 | 0 | 0.000000 |
| FLAG_DOCUMENT_20 | 0 | 0.000000 |
| FLAG_DOCUMENT_19 | 0 | 0.000000 |
| EMERGENCYSTATE_MODE | 0 | 0.000000 |
| FLAG_DOCUMENT_18 | 0 | 0.000000 |
| FLAG_DOCUMENT_17 | 0 | 0.000000 |
| FLAG_DOCUMENT_9 | 0 | 0.000000 |
| FLAG_DOCUMENT_16 | 0 | 0.000000 |
| FLAG_DOCUMENT_15 | 0 | 0.000000 |
| FLAG_DOCUMENT_14 | 0 | 0.000000 |
| FLAG_DOCUMENT_13 | 0 | 0.000000 |
| FLAG_DOCUMENT_10 | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
| NAME_CONTRACT_TYPE | 0 | 0.000000 |
| DAYS_ID_PUBLISH | 0 | 0.000000 |
| CODE_GENDER | 0 | 0.000000 |
| FLAG_OWN_CAR | 0 | 0.000000 |
| FLAG_OWN_REALTY | 0 | 0.000000 |
| CNT_CHILDREN | 0 | 0.000000 |
| AMT_INCOME_TOTAL | 0 | 0.000000 |
| AMT_CREDIT | 0 | 0.000000 |
| NAME_INCOME_TYPE | 0 | 0.000000 |
| NAME_EDUCATION_TYPE | 0 | 0.000000 |
| NAME_FAMILY_STATUS | 0 | 0.000000 |
| NAME_HOUSING_TYPE | 0 | 0.000000 |
| REGION_POPULATION_RELATIVE | 0 | 0.000000 |
| DAYS_BIRTH | 0 | 0.000000 |
| DAYS_EMPLOYED | 0 | 0.000000 |
| DAYS_REGISTRATION | 0 | 0.000000 |
| FLAG_MOBIL | 0 | 0.000000 |
| ORGANIZATION_TYPE | 0 | 0.000000 |
| FLAG_EMP_PHONE | 0 | 0.000000 |
| FLAG_WORK_PHONE | 0 | 0.000000 |
| FLAG_CONT_MOBILE | 0 | 0.000000 |
| FLAG_PHONE | 0 | 0.000000 |
| FLAG_EMAIL | 0 | 0.000000 |
| REGION_RATING_CLIENT | 0 | 0.000000 |
| REGION_RATING_CLIENT_W_CITY | 0 | 0.000000 |
| HOUR_APPR_PROCESS_START | 0 | 0.000000 |
| REG_REGION_NOT_LIVE_REGION | 0 | 0.000000 |
| REG_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| LIVE_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_CITY_NOT_LIVE_CITY | 0 | 0.000000 |
| REG_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| LIVE_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| NWEEKDAY_PROCESS_START | 0 | 0.000000 |
In [10]:
f_aux.nulos_filas(df)
Out[10]:
| nulos_filas | porcentaje_filas | |
|---|---|---|
| SK_ID_CURR | ||
| 235599 | 60 | 0.495868 |
| 412671 | 60 | 0.495868 |
| 315294 | 60 | 0.495868 |
| 255145 | 60 | 0.495868 |
| 412312 | 60 | 0.495868 |
| ... | ... | ... |
| 250717 | 0 | 0.000000 |
| 250702 | 0 | 0.000000 |
| 250697 | 0 | 0.000000 |
| 250680 | 0 | 0.000000 |
| 278202 | 0 | 0.000000 |
307511 rows × 2 columns
Gráficos con distribibución de las variables¶
En la siguiente línea de código, se utiliza un bucle que itera sobre el tipo de variable. Dependiendo de si la variable es continua o categórica/booleana, se llama a la función plot_feature( ). Si la variable es continua, se generara un histograma y un boxplot en relación con la variable objetivo. Si la varaible es categórica o booleana, se mostrarán dos diagramas de barras: uno para la distribución general de la variable y otro en relación con la variable objetivo.¶
In [11]:
warnings.filterwarnings('ignore')
for i in list(df_train.columns):
if (df_train[i].dtype==float) & (i!='TARGET'):
print('Graficos de la variable: ' + i)
f_aux.plot_feature(df_train, col_name=i, isContinuous=True, target='TARGET')
elif i!='TARGET':
print('Graficos de la variable: ' + i)
f_aux.plot_feature(df_train, col_name=i, isContinuous=False, target='TARGET')
Graficos de la variable: NAME_CONTRACT_TYPE
Graficos de la variable: CODE_GENDER
Graficos de la variable: FLAG_OWN_CAR
Graficos de la variable: FLAG_OWN_REALTY
Graficos de la variable: CNT_CHILDREN
Graficos de la variable: AMT_INCOME_TOTAL
Graficos de la variable: AMT_CREDIT
Graficos de la variable: AMT_ANNUITY
Graficos de la variable: AMT_GOODS_PRICE
Graficos de la variable: NAME_TYPE_SUITE
Graficos de la variable: NAME_INCOME_TYPE
Graficos de la variable: NAME_EDUCATION_TYPE
Graficos de la variable: NAME_FAMILY_STATUS
Graficos de la variable: NAME_HOUSING_TYPE
Graficos de la variable: REGION_POPULATION_RELATIVE
Graficos de la variable: DAYS_BIRTH
Graficos de la variable: DAYS_EMPLOYED
Graficos de la variable: DAYS_REGISTRATION
Graficos de la variable: DAYS_ID_PUBLISH
Graficos de la variable: OWN_CAR_AGE
Graficos de la variable: FLAG_MOBIL
Graficos de la variable: FLAG_EMP_PHONE
Graficos de la variable: FLAG_WORK_PHONE
Graficos de la variable: FLAG_CONT_MOBILE
Graficos de la variable: FLAG_PHONE
Graficos de la variable: FLAG_EMAIL
Graficos de la variable: OCCUPATION_TYPE
Graficos de la variable: CNT_FAM_MEMBERS
Graficos de la variable: REGION_RATING_CLIENT
Graficos de la variable: REGION_RATING_CLIENT_W_CITY
Graficos de la variable: HOUR_APPR_PROCESS_START
Graficos de la variable: REG_REGION_NOT_LIVE_REGION
Graficos de la variable: REG_REGION_NOT_WORK_REGION
Graficos de la variable: LIVE_REGION_NOT_WORK_REGION
Graficos de la variable: REG_CITY_NOT_LIVE_CITY
Graficos de la variable: REG_CITY_NOT_WORK_CITY
Graficos de la variable: LIVE_CITY_NOT_WORK_CITY
Graficos de la variable: ORGANIZATION_TYPE
Graficos de la variable: EXT_SOURCE_1
Graficos de la variable: EXT_SOURCE_2
Graficos de la variable: EXT_SOURCE_3
Graficos de la variable: APARTMENTS_AVG
Graficos de la variable: BASEMENTAREA_AVG
Graficos de la variable: YEARS_BEGINEXPLUATATION_AVG
Graficos de la variable: YEARS_BUILD_AVG
Graficos de la variable: COMMONAREA_AVG
Graficos de la variable: ELEVATORS_AVG
Graficos de la variable: ENTRANCES_AVG
Graficos de la variable: FLOORSMAX_AVG
Graficos de la variable: FLOORSMIN_AVG
Graficos de la variable: LANDAREA_AVG
Graficos de la variable: LIVINGAPARTMENTS_AVG
Graficos de la variable: LIVINGAREA_AVG
Graficos de la variable: NONLIVINGAPARTMENTS_AVG
Graficos de la variable: NONLIVINGAREA_AVG
Graficos de la variable: APARTMENTS_MODE
Graficos de la variable: BASEMENTAREA_MODE
Graficos de la variable: YEARS_BEGINEXPLUATATION_MODE
Graficos de la variable: YEARS_BUILD_MODE
Graficos de la variable: COMMONAREA_MODE
Graficos de la variable: ELEVATORS_MODE
Graficos de la variable: ENTRANCES_MODE
Graficos de la variable: FLOORSMAX_MODE
Graficos de la variable: FLOORSMIN_MODE
Graficos de la variable: LANDAREA_MODE
Graficos de la variable: LIVINGAPARTMENTS_MODE
Graficos de la variable: LIVINGAREA_MODE
Graficos de la variable: NONLIVINGAPARTMENTS_MODE
Graficos de la variable: NONLIVINGAREA_MODE
Graficos de la variable: APARTMENTS_MEDI
Graficos de la variable: BASEMENTAREA_MEDI
Graficos de la variable: YEARS_BEGINEXPLUATATION_MEDI
Graficos de la variable: YEARS_BUILD_MEDI
Graficos de la variable: COMMONAREA_MEDI
Graficos de la variable: ELEVATORS_MEDI
Graficos de la variable: ENTRANCES_MEDI
Graficos de la variable: FLOORSMAX_MEDI
Graficos de la variable: FLOORSMIN_MEDI
Graficos de la variable: LANDAREA_MEDI
Graficos de la variable: LIVINGAPARTMENTS_MEDI
Graficos de la variable: LIVINGAREA_MEDI
Graficos de la variable: NONLIVINGAPARTMENTS_MEDI
Graficos de la variable: NONLIVINGAREA_MEDI
Graficos de la variable: FONDKAPREMONT_MODE
Graficos de la variable: HOUSETYPE_MODE
Graficos de la variable: TOTALAREA_MODE
Graficos de la variable: WALLSMATERIAL_MODE
Graficos de la variable: EMERGENCYSTATE_MODE
Graficos de la variable: OBS_30_CNT_SOCIAL_CIRCLE
Graficos de la variable: DEF_30_CNT_SOCIAL_CIRCLE
Graficos de la variable: OBS_60_CNT_SOCIAL_CIRCLE
Graficos de la variable: DEF_60_CNT_SOCIAL_CIRCLE
Graficos de la variable: DAYS_LAST_PHONE_CHANGE
Graficos de la variable: FLAG_DOCUMENT_2
Graficos de la variable: FLAG_DOCUMENT_3
Graficos de la variable: FLAG_DOCUMENT_4
Graficos de la variable: FLAG_DOCUMENT_5
Graficos de la variable: FLAG_DOCUMENT_6
Graficos de la variable: FLAG_DOCUMENT_7
Graficos de la variable: FLAG_DOCUMENT_8
Graficos de la variable: FLAG_DOCUMENT_9
Graficos de la variable: FLAG_DOCUMENT_10
Graficos de la variable: FLAG_DOCUMENT_11
Graficos de la variable: FLAG_DOCUMENT_12
Graficos de la variable: FLAG_DOCUMENT_13
Graficos de la variable: FLAG_DOCUMENT_14
Graficos de la variable: FLAG_DOCUMENT_15
Graficos de la variable: FLAG_DOCUMENT_16
Graficos de la variable: FLAG_DOCUMENT_17
Graficos de la variable: FLAG_DOCUMENT_18
Graficos de la variable: FLAG_DOCUMENT_19
Graficos de la variable: FLAG_DOCUMENT_20
Graficos de la variable: FLAG_DOCUMENT_21
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_HOUR
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_DAY
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_WEEK
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_MON
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_QRT
Graficos de la variable: AMT_REQ_CREDIT_BUREAU_YEAR
Graficos de la variable: NWEEKDAY_PROCESS_START
Conclusiones de los gráficos¶
En estas 121 gráficas, se pueden observar las variaciones tanto de manera individual como con respecto a la variable objetivo. Al plantear esta conclusión, es relevante comenzar desde lo particular hacia lo general. En un primer análisis, observamos aspectos individuales como el género, donde los hombres son quienes tienen una mayor tasa de pago del préstamo en comparación con las mujeres. En cuanto al nivel educativo, se evidencia que, a mayor nivel educativo, hay una mayor tendencia a saldar el préstamo (asociado con la variable 0). En términos de edad, las personas mayores tienen una mayor probabilidad de devolver el préstamo, lo que se refleja en las claras diferencias entre los rangos intercuartílicos del boxplot.¶
De manera más general, se destacan el tipo de trabajo y la organización en la que se labora. Se observa que las personas que trabajan en ambientes formales y bien establecidos, como grandes empresas, tienen mayores probabilidades de devolver el préstamo en tiempo y forma. En contraste, aquellos que desempeñan oficios o trabajos menos especializados, como los trabajadores de baja cualificación, personal de camareros y conductores, tienden a tener una menor tasa de pago puntual.¶
Finalmente, existen variables que resultan determinantes para el modelo, tales como el ingreso, la referencia de otros bancos, la situación de tu círculo cercano y el lugar de residencia. Estos factores son indicadores de la capacidad económica y la estabilidad de las personas, lo que afecta directamente su capacidad para afrontar pagos de préstamos.¶
Tratamiento de las variables continuas¶
A continuación, se tratan los valores missing, las correlaciones de las variables continuas y los outliers.¶
In [12]:
lista_var_con
Out[12]:
['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']
Por medio de la función de get_deviation_of_mean_perc( ), se determina que proporción de las variables continuas se situan fuera de un intervalo de confianza basado en la media y la desviación estándar, siendo multiplicada por el factor multiplier. En este caso la función nos da el número y porcentaje de valores fuera del rango, a la par de detertminar como se distribuyen estos valores extremos conforme a la variable objetivo.¶
In [13]:
f_aux.get_deviation_of_mean_perc(df_train, lista_var_con, target = 'TARGET', multiplier = 3)
Out[13]:
| variable | 0 | 1 | sum_outlier_values | porcentaje_sum_null_values | |
|---|---|---|---|---|---|
| 0 | AMT_INCOME_TOTAL | 0.947115 | 0.052885 | 208 | 0.000846 |
| 1 | AMT_CREDIT | 0.958763 | 0.041237 | 2619 | 0.010646 |
| 2 | AMT_ANNUITY | 0.963606 | 0.036394 | 2363 | 0.009605 |
| 3 | AMT_GOODS_PRICE | 0.962963 | 0.037037 | 3321 | 0.013500 |
| 4 | REGION_POPULATION_RELATIVE | 0.960321 | 0.039679 | 6729 | 0.027353 |
| 5 | DAYS_REGISTRATION | 0.957586 | 0.042414 | 613 | 0.002492 |
| 6 | OWN_CAR_AGE | 0.915541 | 0.084459 | 2664 | 0.010829 |
| 7 | CNT_FAM_MEMBERS | 0.902377 | 0.097623 | 3155 | 0.012825 |
| 8 | APARTMENTS_AVG | 0.949831 | 0.050169 | 2372 | 0.009642 |
| 9 | BASEMENTAREA_AVG | 0.948604 | 0.051396 | 1576 | 0.006406 |
| 10 | YEARS_BEGINEXPLUATATION_AVG | 0.906526 | 0.093474 | 567 | 0.002305 |
| 11 | YEARS_BUILD_AVG | 0.927597 | 0.072403 | 953 | 0.003874 |
| 12 | COMMONAREA_AVG | 0.941691 | 0.058309 | 1372 | 0.005577 |
| 13 | ELEVATORS_AVG | 0.955647 | 0.044353 | 1939 | 0.007882 |
| 14 | ENTRANCES_AVG | 0.939684 | 0.060316 | 1774 | 0.007211 |
| 15 | FLOORSMAX_AVG | 0.957046 | 0.042954 | 2072 | 0.008422 |
| 16 | FLOORSMIN_AVG | 0.960870 | 0.039130 | 460 | 0.001870 |
| 17 | LANDAREA_AVG | 0.933374 | 0.066626 | 1651 | 0.006711 |
| 18 | LIVINGAPARTMENTS_AVG | 0.948958 | 0.051042 | 1391 | 0.005654 |
| 19 | LIVINGAREA_AVG | 0.948134 | 0.051866 | 2545 | 0.010345 |
| 20 | NONLIVINGAPARTMENTS_AVG | 0.929174 | 0.070826 | 593 | 0.002410 |
| 21 | NONLIVINGAREA_AVG | 0.946875 | 0.053125 | 1920 | 0.007805 |
| 22 | APARTMENTS_MODE | 0.950021 | 0.049979 | 2401 | 0.009760 |
| 23 | BASEMENTAREA_MODE | 0.946789 | 0.053211 | 1635 | 0.006646 |
| 24 | YEARS_BEGINEXPLUATATION_MODE | 0.904676 | 0.095324 | 556 | 0.002260 |
| 25 | YEARS_BUILD_MODE | 0.928423 | 0.071577 | 964 | 0.003919 |
| 26 | COMMONAREA_MODE | 0.938462 | 0.061538 | 1365 | 0.005549 |
| 27 | ELEVATORS_MODE | 0.952078 | 0.047922 | 2671 | 0.010857 |
| 28 | ENTRANCES_MODE | 0.938601 | 0.061399 | 1759 | 0.007150 |
| 29 | FLOORSMAX_MODE | 0.958591 | 0.041409 | 2101 | 0.008540 |
| 30 | FLOORSMIN_MODE | 0.963061 | 0.036939 | 379 | 0.001541 |
| 31 | LANDAREA_MODE | 0.932749 | 0.067251 | 1710 | 0.006951 |
| 32 | LIVINGAPARTMENTS_MODE | 0.946191 | 0.053809 | 1431 | 0.005817 |
| 33 | LIVINGAREA_MODE | 0.948134 | 0.051866 | 2680 | 0.010894 |
| 34 | NONLIVINGAPARTMENTS_MODE | 0.921429 | 0.078571 | 560 | 0.002276 |
| 35 | NONLIVINGAREA_MODE | 0.947773 | 0.052227 | 1953 | 0.007939 |
| 36 | APARTMENTS_MEDI | 0.949938 | 0.050062 | 2417 | 0.009825 |
| 37 | BASEMENTAREA_MEDI | 0.949057 | 0.050943 | 1590 | 0.006463 |
| 38 | YEARS_BEGINEXPLUATATION_MEDI | 0.902985 | 0.097015 | 536 | 0.002179 |
| 39 | YEARS_BUILD_MEDI | 0.928200 | 0.071800 | 961 | 0.003906 |
| 40 | COMMONAREA_MEDI | 0.940374 | 0.059626 | 1392 | 0.005658 |
| 41 | ELEVATORS_MEDI | 0.954969 | 0.045031 | 1932 | 0.007853 |
| 42 | ENTRANCES_MEDI | 0.938833 | 0.061167 | 1782 | 0.007244 |
| 43 | FLOORSMAX_MEDI | 0.956861 | 0.043139 | 2179 | 0.008857 |
| 44 | FLOORSMIN_MEDI | 0.960648 | 0.039352 | 432 | 0.001756 |
| 45 | LANDAREA_MEDI | 0.935807 | 0.064193 | 1698 | 0.006902 |
| 46 | LIVINGAPARTMENTS_MEDI | 0.947745 | 0.052255 | 1397 | 0.005679 |
| 47 | LIVINGAREA_MEDI | 0.949495 | 0.050505 | 2574 | 0.010463 |
| 48 | NONLIVINGAPARTMENTS_MEDI | 0.926995 | 0.073005 | 589 | 0.002394 |
| 49 | NONLIVINGAREA_MEDI | 0.947152 | 0.052848 | 1949 | 0.007923 |
| 50 | TOTALAREA_MODE | 0.956032 | 0.043968 | 2661 | 0.010817 |
| 51 | OBS_30_CNT_SOCIAL_CIRCLE | 0.907786 | 0.092214 | 4945 | 0.020101 |
| 52 | DEF_30_CNT_SOCIAL_CIRCLE | 0.881830 | 0.118170 | 5509 | 0.022394 |
| 53 | OBS_60_CNT_SOCIAL_CIRCLE | 0.907311 | 0.092689 | 4801 | 0.019516 |
| 54 | DEF_60_CNT_SOCIAL_CIRCLE | 0.872681 | 0.127319 | 3126 | 0.012707 |
| 55 | DAYS_LAST_PHONE_CHANGE | 0.961847 | 0.038153 | 498 | 0.002024 |
| 56 | AMT_REQ_CREDIT_BUREAU_HOUR | 0.918750 | 0.081250 | 1280 | 0.005203 |
| 57 | AMT_REQ_CREDIT_BUREAU_DAY | 0.902813 | 0.097187 | 1173 | 0.004768 |
| 58 | AMT_REQ_CREDIT_BUREAU_WEEK | 0.921424 | 0.078576 | 6796 | 0.027625 |
| 59 | AMT_REQ_CREDIT_BUREAU_MON | 0.947531 | 0.052469 | 2592 | 0.010536 |
| 60 | AMT_REQ_CREDIT_BUREAU_QRT | 0.916026 | 0.083974 | 1822 | 0.007406 |
| 61 | AMT_REQ_CREDIT_BUREAU_YEAR | 0.909022 | 0.090978 | 2649 | 0.010768 |
| 62 | HOUR_APPR_PROCESS_START | 0.898167 | 0.101833 | 491 | 0.001996 |
Conclusiones del impacto de las variables continuas con respecto a la variable objetivo¶
Cuando una variable presenta un mayor número de valores fuera del intervalo de confianza, nos indica una alta dispersión en los datos. Por lo que son más relevantes en la evaluación de riesgos por parte del banco, ya que van relacionadas a perfiles más diversos en los solicitantes, un ejemplo es la variable CNT_FAM_MEMBERS, que presenta 3,155 valores fuera del intervalo de confianza, indicando una mayor heterogeneidad en los tamaños de las familias, lo cual es relevante para la evaluar riesgos, asociandolo con el cumplimiento del préstamo.¶
Por otro lado, variables con un menor número de valores fuera del intervalo, un ejemplo es AMT_INCOME_TOTAL con solo 208 valores atípicos, sugiere que los solicitantes tienen ingresos similares. Indicando un perfil más homogéneo entre ellos en cuestión de esta variable. A partir de este análisis, es posible identificar variables clave para establecer perfiles generales de los solicitantes.¶
Gráfica de correlación¶
In [14]:
f_aux.get_corr_matrix(dataset = df_train[lista_var_con], metodo = 'pearson', size_figure = [10,8])
Out[14]:
0
In [15]:
corr = df_train[lista_var_con].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']> 0.6]
Out[15]:
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 3918 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 0.998514 |
| 2912 | YEARS_BUILD_MEDI | YEARS_BUILD_AVG | 0.998391 |
| 3262 | FLOORSMIN_MEDI | FLOORSMIN_AVG | 0.997322 |
| 3192 | FLOORSMAX_MEDI | FLOORSMAX_AVG | 0.996983 |
| 3122 | ENTRANCES_MEDI | ENTRANCES_AVG | 0.996911 |
| 3052 | ELEVATORS_MEDI | ELEVATORS_AVG | 0.996319 |
| 2982 | COMMONAREA_MEDI | COMMONAREA_AVG | 0.995660 |
| 3472 | LIVINGAREA_MEDI | LIVINGAREA_AVG | 0.995472 |
| 2702 | APARTMENTS_MEDI | APARTMENTS_AVG | 0.995430 |
| 2772 | BASEMENTAREA_MEDI | BASEMENTAREA_AVG | 0.994335 |
| 2842 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_AVG | 0.994314 |
| 3402 | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | 0.993621 |
| 3612 | NONLIVINGAREA_MEDI | NONLIVINGAREA_AVG | 0.991197 |
| 3332 | LANDAREA_MEDI | LANDAREA_AVG | 0.991056 |
| 1946 | YEARS_BUILD_MODE | YEARS_BUILD_AVG | 0.989372 |
| 2926 | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | 0.989272 |
| 3542 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAPARTMENTS_AVG | 0.989047 |
| 3276 | FLOORSMIN_MEDI | FLOORSMIN_MODE | 0.988735 |
| 3206 | FLOORSMAX_MEDI | FLOORSMAX_MODE | 0.988205 |
| 208 | AMT_GOODS_PRICE | AMT_CREDIT | 0.987000 |
| 2296 | FLOORSMIN_MODE | FLOORSMIN_AVG | 0.986250 |
| 2226 | FLOORSMAX_MODE | FLOORSMAX_AVG | 0.985561 |
| 3066 | ELEVATORS_MEDI | ELEVATORS_MODE | 0.982819 |
| 3346 | LANDAREA_MEDI | LANDAREA_MODE | 0.981517 |
| 3556 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAPARTMENTS_MODE | 0.981259 |
| 3136 | ENTRANCES_MEDI | ENTRANCES_MODE | 0.981012 |
| 2086 | ELEVATORS_MODE | ELEVATORS_AVG | 0.979161 |
| 2996 | COMMONAREA_MEDI | COMMONAREA_MODE | 0.978934 |
| 2156 | ENTRANCES_MODE | ENTRANCES_AVG | 0.978034 |
| 2786 | BASEMENTAREA_MEDI | BASEMENTAREA_MODE | 0.977787 |
| 2716 | APARTMENTS_MEDI | APARTMENTS_MODE | 0.977514 |
| 3626 | NONLIVINGAREA_MEDI | NONLIVINGAREA_MODE | 0.976066 |
| 2016 | COMMONAREA_MODE | COMMONAREA_AVG | 0.975988 |
| 3486 | LIVINGAREA_MEDI | LIVINGAREA_MODE | 0.975391 |
| 3416 | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_MODE | 0.975138 |
| 1736 | APARTMENTS_MODE | APARTMENTS_AVG | 0.974062 |
| 1806 | BASEMENTAREA_MODE | BASEMENTAREA_AVG | 0.973389 |
| 1876 | YEARS_BEGINEXPLUATATION_MODE | YEARS_BEGINEXPLUATATION_AVG | 0.973181 |
| 2366 | LANDAREA_MODE | LANDAREA_AVG | 0.973105 |
| 2506 | LIVINGAREA_MODE | LIVINGAREA_AVG | 0.972434 |
| 2576 | NONLIVINGAPARTMENTS_MODE | NONLIVINGAPARTMENTS_AVG | 0.970068 |
| 2436 | LIVINGAPARTMENTS_MODE | LIVINGAPARTMENTS_AVG | 0.969449 |
| 2646 | NONLIVINGAREA_MODE | NONLIVINGAREA_AVG | 0.967162 |
| 2856 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | 0.966567 |
| 1460 | LIVINGAPARTMENTS_AVG | APARTMENTS_AVG | 0.945033 |
| 3420 | LIVINGAPARTMENTS_MEDI | APARTMENTS_MEDI | 0.943933 |
| 3392 | LIVINGAPARTMENTS_MEDI | APARTMENTS_AVG | 0.943237 |
| 2440 | LIVINGAPARTMENTS_MODE | APARTMENTS_MODE | 0.940871 |
| 2712 | APARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | 0.936922 |
| 2726 | APARTMENTS_MEDI | LIVINGAPARTMENTS_MODE | 0.934145 |
| 2426 | LIVINGAPARTMENTS_MODE | APARTMENTS_AVG | 0.933212 |
| 3679 | TOTALAREA_MODE | LIVINGAREA_AVG | 0.925936 |
| 3707 | TOTALAREA_MODE | LIVINGAREA_MEDI | 0.920434 |
| 3406 | LIVINGAPARTMENTS_MEDI | APARTMENTS_MODE | 0.916930 |
| 3489 | LIVINGAREA_MEDI | APARTMENTS_MEDI | 0.916000 |
| 1529 | LIVINGAREA_AVG | APARTMENTS_AVG | 0.913769 |
| 3461 | LIVINGAREA_MEDI | APARTMENTS_AVG | 0.912728 |
| 2713 | APARTMENTS_MEDI | LIVINGAREA_AVG | 0.912434 |
| 2509 | LIVINGAREA_MODE | APARTMENTS_MODE | 0.910780 |
| 1746 | APARTMENTS_MODE | LIVINGAPARTMENTS_AVG | 0.910617 |
| 3693 | TOTALAREA_MODE | LIVINGAREA_MODE | 0.900830 |
| 2727 | APARTMENTS_MEDI | LIVINGAREA_MODE | 0.897113 |
| 2495 | LIVINGAREA_MODE | APARTMENTS_AVG | 0.894706 |
| 3668 | TOTALAREA_MODE | APARTMENTS_AVG | 0.894517 |
| 3475 | LIVINGAREA_MEDI | APARTMENTS_MODE | 0.894490 |
| 1747 | APARTMENTS_MODE | LIVINGAREA_AVG | 0.891087 |
| 3696 | TOTALAREA_MODE | APARTMENTS_MEDI | 0.888248 |
| 3499 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_MEDI | 0.885952 |
| 3403 | LIVINGAPARTMENTS_MEDI | LIVINGAREA_AVG | 0.884223 |
| 2519 | LIVINGAREA_MODE | LIVINGAPARTMENTS_MODE | 0.881550 |
| 1539 | LIVINGAREA_AVG | LIVINGAPARTMENTS_AVG | 0.881535 |
| 3471 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_AVG | 0.879412 |
| 3485 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_MODE | 0.876144 |
| 2437 | LIVINGAPARTMENTS_MODE | LIVINGAREA_AVG | 0.874517 |
| 3494 | LIVINGAREA_MEDI | ELEVATORS_MEDI | 0.868383 |
| 1534 | LIVINGAREA_AVG | ELEVATORS_AVG | 0.867590 |
| 3466 | LIVINGAREA_MEDI | ELEVATORS_AVG | 0.865807 |
| 3682 | TOTALAREA_MODE | APARTMENTS_MODE | 0.865652 |
| 3058 | ELEVATORS_MEDI | LIVINGAREA_AVG | 0.865594 |
| 3417 | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MODE | 0.860090 |
| 3988 | DEF_60_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | 0.859088 |
| 2514 | LIVINGAREA_MODE | ELEVATORS_MODE | 0.856213 |
| 3480 | LIVINGAREA_MEDI | ELEVATORS_MODE | 0.856028 |
| 2505 | LIVINGAREA_MODE | LIVINGAPARTMENTS_AVG | 0.854263 |
| 2092 | ELEVATORS_MODE | LIVINGAREA_AVG | 0.852650 |
| 3678 | TOTALAREA_MODE | LIVINGAPARTMENTS_AVG | 0.851677 |
| 3706 | TOTALAREA_MODE | LIVINGAPARTMENTS_MEDI | 0.849999 |
| 3673 | TOTALAREA_MODE | ELEVATORS_AVG | 0.846518 |
| 3072 | ELEVATORS_MEDI | LIVINGAREA_MODE | 0.841594 |
| 3701 | TOTALAREA_MODE | ELEVATORS_MEDI | 0.840034 |
| 2500 | LIVINGAREA_MODE | ELEVATORS_AVG | 0.839674 |
| 3692 | TOTALAREA_MODE | LIVINGAPARTMENTS_MODE | 0.838990 |
| 3075 | ELEVATORS_MEDI | APARTMENTS_MEDI | 0.838494 |
| 1115 | ELEVATORS_AVG | APARTMENTS_AVG | 0.838347 |
| 3047 | ELEVATORS_MEDI | APARTMENTS_AVG | 0.836446 |
| 2707 | APARTMENTS_MEDI | ELEVATORS_AVG | 0.835813 |
| 2095 | ELEVATORS_MODE | APARTMENTS_MODE | 0.827446 |
| 2721 | APARTMENTS_MEDI | ELEVATORS_MODE | 0.826966 |
| 2081 | ELEVATORS_MODE | APARTMENTS_AVG | 0.824274 |
| 3687 | TOTALAREA_MODE | ELEVATORS_MODE | 0.822899 |
| 3425 | LIVINGAPARTMENTS_MEDI | ELEVATORS_MEDI | 0.815880 |
| 3397 | LIVINGAPARTMENTS_MEDI | ELEVATORS_AVG | 0.814409 |
| 1465 | LIVINGAPARTMENTS_AVG | ELEVATORS_AVG | 0.813318 |
| 3057 | ELEVATORS_MEDI | LIVINGAPARTMENTS_AVG | 0.810824 |
| 2445 | LIVINGAPARTMENTS_MODE | ELEVATORS_MODE | 0.810610 |
| 3061 | ELEVATORS_MEDI | APARTMENTS_MODE | 0.810200 |
| 1741 | APARTMENTS_MODE | ELEVATORS_AVG | 0.807742 |
| 3411 | LIVINGAPARTMENTS_MEDI | ELEVATORS_MODE | 0.802015 |
| 3071 | ELEVATORS_MEDI | LIVINGAPARTMENTS_MODE | 0.800567 |
| 2431 | LIVINGAPARTMENTS_MODE | ELEVATORS_AVG | 0.799062 |
| 2091 | ELEVATORS_MODE | LIVINGAPARTMENTS_AVG | 0.796596 |
| 209 | AMT_GOODS_PRICE | AMT_ANNUITY | 0.775310 |
| 139 | AMT_ANNUITY | AMT_CREDIT | 0.770163 |
| 1329 | FLOORSMIN_AVG | FLOORSMAX_AVG | 0.739772 |
| 3289 | FLOORSMIN_MEDI | FLOORSMAX_MEDI | 0.737659 |
| 3261 | FLOORSMIN_MEDI | FLOORSMAX_AVG | 0.737192 |
| 3193 | FLOORSMAX_MEDI | FLOORSMIN_AVG | 0.737175 |
| 3275 | FLOORSMIN_MEDI | FLOORSMAX_MODE | 0.727059 |
| 2227 | FLOORSMAX_MODE | FLOORSMIN_AVG | 0.726317 |
| 2309 | FLOORSMIN_MODE | FLOORSMAX_MODE | 0.723685 |
| 3207 | FLOORSMAX_MEDI | FLOORSMIN_MODE | 0.720675 |
| 2295 | FLOORSMIN_MODE | FLOORSMAX_AVG | 0.720125 |
| 1530 | LIVINGAREA_AVG | BASEMENTAREA_AVG | 0.692715 |
| 2782 | BASEMENTAREA_MEDI | LIVINGAREA_AVG | 0.692455 |
| 3490 | LIVINGAREA_MEDI | BASEMENTAREA_MEDI | 0.691733 |
| 2510 | LIVINGAREA_MODE | BASEMENTAREA_MODE | 0.690915 |
| 3462 | LIVINGAREA_MEDI | BASEMENTAREA_AVG | 0.689655 |
| 2796 | BASEMENTAREA_MEDI | LIVINGAREA_MODE | 0.680955 |
| 2799 | BASEMENTAREA_MEDI | APARTMENTS_MEDI | 0.680276 |
| 1258 | FLOORSMAX_AVG | ELEVATORS_AVG | 0.680109 |
| 839 | BASEMENTAREA_AVG | APARTMENTS_AVG | 0.679130 |
| 2771 | BASEMENTAREA_MEDI | APARTMENTS_AVG | 0.678835 |
| 1819 | BASEMENTAREA_MODE | APARTMENTS_MODE | 0.678681 |
| 2703 | APARTMENTS_MEDI | BASEMENTAREA_AVG | 0.677920 |
| 3190 | FLOORSMAX_MEDI | ELEVATORS_AVG | 0.677855 |
| 2496 | LIVINGAREA_MODE | BASEMENTAREA_AVG | 0.677536 |
| 3054 | ELEVATORS_MEDI | FLOORSMAX_AVG | 0.676167 |
| 3218 | FLOORSMAX_MEDI | ELEVATORS_MEDI | 0.675603 |
| 1816 | BASEMENTAREA_MODE | LIVINGAREA_AVG | 0.674590 |
| 3476 | LIVINGAREA_MEDI | BASEMENTAREA_MODE | 0.674313 |
| 3669 | TOTALAREA_MODE | BASEMENTAREA_AVG | 0.672990 |
| 2224 | FLOORSMAX_MODE | ELEVATORS_AVG | 0.671161 |
| 3697 | TOTALAREA_MODE | BASEMENTAREA_MEDI | 0.670219 |
| 3068 | ELEVATORS_MEDI | FLOORSMAX_MODE | 0.669194 |
| 2785 | BASEMENTAREA_MEDI | APARTMENTS_MODE | 0.668579 |
| 1737 | APARTMENTS_MODE | BASEMENTAREA_AVG | 0.665837 |
| 2717 | APARTMENTS_MEDI | BASEMENTAREA_MODE | 0.664087 |
| 1805 | BASEMENTAREA_MODE | APARTMENTS_AVG | 0.661715 |
| 2238 | FLOORSMAX_MODE | ELEVATORS_MODE | 0.660813 |
| 2441 | LIVINGAPARTMENTS_MODE | BASEMENTAREA_MODE | 0.657202 |
| 2088 | ELEVATORS_MODE | FLOORSMAX_AVG | 0.655993 |
| 3204 | FLOORSMAX_MEDI | ELEVATORS_MODE | 0.655301 |
| 3421 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_MEDI | 0.654280 |
| 2795 | BASEMENTAREA_MEDI | LIVINGAPARTMENTS_MODE | 0.654117 |
| 2165 | ENTRANCES_MODE | BASEMENTAREA_MODE | 0.653678 |
| 1811 | BASEMENTAREA_MODE | ENTRANCES_AVG | 0.653412 |
| 2777 | BASEMENTAREA_MEDI | ENTRANCES_AVG | 0.652667 |
| 3131 | ENTRANCES_MEDI | BASEMENTAREA_MODE | 0.652286 |
| 3393 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_AVG | 0.651640 |
| 3145 | ENTRANCES_MEDI | BASEMENTAREA_MEDI | 0.651115 |
| 1185 | ENTRANCES_AVG | BASEMENTAREA_AVG | 0.650657 |
| 3683 | TOTALAREA_MODE | BASEMENTAREA_MODE | 0.650591 |
| 2427 | LIVINGAPARTMENTS_MODE | BASEMENTAREA_AVG | 0.649819 |
| 2781 | BASEMENTAREA_MEDI | LIVINGAPARTMENTS_AVG | 0.649458 |
| 1461 | LIVINGAPARTMENTS_AVG | BASEMENTAREA_AVG | 0.649358 |
| 3117 | ENTRANCES_MEDI | BASEMENTAREA_AVG | 0.646530 |
| 3675 | TOTALAREA_MODE | FLOORSMAX_AVG | 0.633699 |
| 3407 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_MODE | 0.632946 |
| 3703 | TOTALAREA_MODE | FLOORSMAX_MEDI | 0.631241 |
| 2791 | BASEMENTAREA_MEDI | ENTRANCES_MODE | 0.630842 |
| 1536 | LIVINGAREA_AVG | FLOORSMAX_AVG | 0.629838 |
| 1815 | BASEMENTAREA_MODE | LIVINGAPARTMENTS_AVG | 0.628019 |
| 3196 | FLOORSMAX_MEDI | LIVINGAREA_AVG | 0.627858 |
| 3689 | TOTALAREA_MODE | FLOORSMAX_MODE | 0.626663 |
| 3468 | LIVINGAREA_MEDI | FLOORSMAX_AVG | 0.626568 |
| 2151 | ENTRANCES_MODE | BASEMENTAREA_AVG | 0.625835 |
| 3496 | LIVINGAREA_MEDI | FLOORSMAX_MEDI | 0.625791 |
| 2230 | FLOORSMAX_MODE | LIVINGAREA_AVG | 0.625579 |
| 3482 | LIVINGAREA_MEDI | FLOORSMAX_MODE | 0.623980 |
| 2501 | LIVINGAREA_MODE | ENTRANCES_AVG | 0.623472 |
| 2515 | LIVINGAREA_MODE | ENTRANCES_MODE | 0.623182 |
| 3141 | ENTRANCES_MEDI | LIVINGAREA_MODE | 0.622801 |
| 3467 | LIVINGAREA_MEDI | ENTRANCES_AVG | 0.619999 |
| 3495 | LIVINGAREA_MEDI | ENTRANCES_MEDI | 0.619562 |
| 1253 | FLOORSMAX_AVG | APARTMENTS_AVG | 0.619371 |
| 1535 | LIVINGAREA_AVG | ENTRANCES_AVG | 0.619216 |
| 3185 | FLOORSMAX_MEDI | APARTMENTS_AVG | 0.617271 |
| 2709 | APARTMENTS_MEDI | FLOORSMAX_AVG | 0.615685 |
| 3127 | ENTRANCES_MEDI | LIVINGAREA_AVG | 0.615367 |
| 2219 | FLOORSMAX_MODE | APARTMENTS_AVG | 0.615358 |
| 2164 | ENTRANCES_MODE | APARTMENTS_MODE | 0.615096 |
| 3213 | FLOORSMAX_MEDI | APARTMENTS_MEDI | 0.614799 |
| 4619 | DAYS_EMPLOYED | DAYS_BIRTH | 0.614650 |
| 2723 | APARTMENTS_MEDI | FLOORSMAX_MODE | 0.613281 |
| 3130 | ENTRANCES_MEDI | APARTMENTS_MODE | 0.611987 |
| 1742 | APARTMENTS_MODE | ENTRANCES_AVG | 0.611935 |
| 2708 | APARTMENTS_MEDI | ENTRANCES_AVG | 0.610877 |
| 1184 | ENTRANCES_AVG | APARTMENTS_AVG | 0.610653 |
| 3144 | ENTRANCES_MEDI | APARTMENTS_MEDI | 0.610636 |
| 3116 | ENTRANCES_MEDI | APARTMENTS_AVG | 0.606969 |
| 2516 | LIVINGAREA_MODE | FLOORSMAX_MODE | 0.605667 |
| 4493 | DAYS_BIRTH | EXT_SOURCE_1 | 0.601112 |
Conclusiones correlación entre variables¶
Existen valores extremadamente altos en la correlación entre ciertas variables, estas llegan a presentar el mismo tipo de variable, unicamente cambia el valor estadístico, un ejemplo es YEARS_BUILD_MEDI y YEARS_BUILD_AVG. La presencia de variables redundantes en un modelo predictivo afecta la estabilidad y la interpretrabilidad del modelo, por lo que es necesario identificarlas para sacarlas del modelo.¶
Asimismo, en las variables relacionadas con ingresos, crédito y situación laboral, generan casos con una relación proporcional directa, como lo es AMT_ANNUITY y AMT_CREDIT. Con el cual se puede establecer que si uno aumenta el otro lo hará a la par, es decir, si la cantidad de crédito requerido aumenta, la anualidad lo hará también. Esto de igual manera que la anterior sección, donde se establecen los valores dentro de un intervalo de confianza, nos ayuda a identificar patrones más complejos en los perfiles de los solicitantes.¶
Tratamiento valores nulos (Variables Continuas)¶
In [16]:
lista_var_con
Out[16]:
['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']
In [17]:
f_aux.get_percent_null_values_target(pd_loan = df_train, list_var_continuous = lista_var_con, target = 'TARGET')
Out[17]:
| Category_0 | variable | sum_null_values | porcentaje_sum_null_values | Category_1 | |
|---|---|---|---|---|---|
| 0 | 1.000000 | AMT_ANNUITY | 10 | 0.000041 | NaN |
| 1 | 0.923077 | AMT_GOODS_PRICE | 221 | 0.000898 | 0.076923 |
| 2 | 0.915163 | OWN_CAR_AGE | 162418 | 0.660214 | 0.084837 |
| 3 | 1.000000 | CNT_FAM_MEMBERS | 2 | 0.000008 | NaN |
| 4 | 0.914752 | EXT_SOURCE_1 | 138595 | 0.563376 | 0.085248 |
| 5 | 0.922787 | EXT_SOURCE_2 | 531 | 0.002158 | 0.077213 |
| 6 | 0.907223 | EXT_SOURCE_3 | 48805 | 0.198388 | 0.092777 |
| 7 | 0.908612 | APARTMENTS_AVG | 124732 | 0.507024 | 0.091388 |
| 8 | 0.911054 | BASEMENTAREA_AVG | 143829 | 0.584652 | 0.088946 |
| 9 | 0.908069 | YEARS_BEGINEXPLUATATION_AVG | 119949 | 0.487582 | 0.091931 |
| 10 | 0.913381 | YEARS_BUILD_AVG | 163543 | 0.664787 | 0.086619 |
| 11 | 0.914441 | COMMONAREA_AVG | 171811 | 0.698396 | 0.085559 |
| 12 | 0.909309 | ELEVATORS_AVG | 131017 | 0.532572 | 0.090691 |
| 13 | 0.908366 | ENTRANCES_AVG | 123775 | 0.503134 | 0.091634 |
| 14 | 0.908191 | FLOORSMAX_AVG | 122297 | 0.497126 | 0.091809 |
| 15 | 0.913863 | FLOORSMIN_AVG | 166921 | 0.678519 | 0.086137 |
| 16 | 0.912066 | LANDAREA_AVG | 145985 | 0.593416 | 0.087934 |
| 17 | 0.913972 | LIVINGAPARTMENTS_AVG | 168119 | 0.683388 | 0.086028 |
| 18 | 0.908725 | LIVINGAREA_AVG | 123462 | 0.501862 | 0.091275 |
| 19 | 0.914273 | NONLIVINGAPARTMENTS_AVG | 170729 | 0.693998 | 0.085727 |
| 20 | 0.909750 | NONLIVINGAREA_AVG | 135624 | 0.551299 | 0.090250 |
| 21 | 0.908612 | APARTMENTS_MODE | 124732 | 0.507024 | 0.091388 |
| 22 | 0.911054 | BASEMENTAREA_MODE | 143829 | 0.584652 | 0.088946 |
| 23 | 0.908069 | YEARS_BEGINEXPLUATATION_MODE | 119949 | 0.487582 | 0.091931 |
| 24 | 0.913381 | YEARS_BUILD_MODE | 163543 | 0.664787 | 0.086619 |
| 25 | 0.914441 | COMMONAREA_MODE | 171811 | 0.698396 | 0.085559 |
| 26 | 0.909309 | ELEVATORS_MODE | 131017 | 0.532572 | 0.090691 |
| 27 | 0.908366 | ENTRANCES_MODE | 123775 | 0.503134 | 0.091634 |
| 28 | 0.908191 | FLOORSMAX_MODE | 122297 | 0.497126 | 0.091809 |
| 29 | 0.913863 | FLOORSMIN_MODE | 166921 | 0.678519 | 0.086137 |
| 30 | 0.912066 | LANDAREA_MODE | 145985 | 0.593416 | 0.087934 |
| 31 | 0.913972 | LIVINGAPARTMENTS_MODE | 168119 | 0.683388 | 0.086028 |
| 32 | 0.908725 | LIVINGAREA_MODE | 123462 | 0.501862 | 0.091275 |
| 33 | 0.914273 | NONLIVINGAPARTMENTS_MODE | 170729 | 0.693998 | 0.085727 |
| 34 | 0.909750 | NONLIVINGAREA_MODE | 135624 | 0.551299 | 0.090250 |
| 35 | 0.908612 | APARTMENTS_MEDI | 124732 | 0.507024 | 0.091388 |
| 36 | 0.911054 | BASEMENTAREA_MEDI | 143829 | 0.584652 | 0.088946 |
| 37 | 0.908069 | YEARS_BEGINEXPLUATATION_MEDI | 119949 | 0.487582 | 0.091931 |
| 38 | 0.913381 | YEARS_BUILD_MEDI | 163543 | 0.664787 | 0.086619 |
| 39 | 0.914441 | COMMONAREA_MEDI | 171811 | 0.698396 | 0.085559 |
| 40 | 0.909309 | ELEVATORS_MEDI | 131017 | 0.532572 | 0.090691 |
| 41 | 0.908366 | ENTRANCES_MEDI | 123775 | 0.503134 | 0.091634 |
| 42 | 0.908191 | FLOORSMAX_MEDI | 122297 | 0.497126 | 0.091809 |
| 43 | 0.913863 | FLOORSMIN_MEDI | 166921 | 0.678519 | 0.086137 |
| 44 | 0.912066 | LANDAREA_MEDI | 145985 | 0.593416 | 0.087934 |
| 45 | 0.913972 | LIVINGAPARTMENTS_MEDI | 168119 | 0.683388 | 0.086028 |
| 46 | 0.908725 | LIVINGAREA_MEDI | 123462 | 0.501862 | 0.091275 |
| 47 | 0.914273 | NONLIVINGAPARTMENTS_MEDI | 170729 | 0.693998 | 0.085727 |
| 48 | 0.909750 | NONLIVINGAREA_MEDI | 135624 | 0.551299 | 0.090250 |
| 49 | 0.907756 | TOTALAREA_MODE | 118707 | 0.482533 | 0.092244 |
| 50 | 0.960543 | OBS_30_CNT_SOCIAL_CIRCLE | 811 | 0.003297 | 0.039457 |
| 51 | 0.960543 | DEF_30_CNT_SOCIAL_CIRCLE | 811 | 0.003297 | 0.039457 |
| 52 | 0.960543 | OBS_60_CNT_SOCIAL_CIRCLE | 811 | 0.003297 | 0.039457 |
| 53 | 0.960543 | DEF_60_CNT_SOCIAL_CIRCLE | 811 | 0.003297 | 0.039457 |
| 54 | 1.000000 | DAYS_LAST_PHONE_CHANGE | 1 | 0.000004 | NaN |
| 55 | 0.896613 | AMT_REQ_CREDIT_BUREAU_HOUR | 33244 | 0.135134 | 0.103387 |
| 56 | 0.896613 | AMT_REQ_CREDIT_BUREAU_DAY | 33244 | 0.135134 | 0.103387 |
| 57 | 0.896613 | AMT_REQ_CREDIT_BUREAU_WEEK | 33244 | 0.135134 | 0.103387 |
| 58 | 0.896613 | AMT_REQ_CREDIT_BUREAU_MON | 33244 | 0.135134 | 0.103387 |
| 59 | 0.896613 | AMT_REQ_CREDIT_BUREAU_QRT | 33244 | 0.135134 | 0.103387 |
| 60 | 0.896613 | AMT_REQ_CREDIT_BUREAU_YEAR | 33244 | 0.135134 | 0.103387 |
Conclusiones de porcentaje de valores nulos¶
Por medio del anterior análisis, es posible clasificar las variables en dos grupos: variables a imputar y variables a eliminar. Con esta categorización, se optimiza la calidad del conjunot de datos para los modelos predictivos. Sin embargo, es importante considerar también el significado detrás de los valores nulos. En este caso, los valores nulos pueden reflejar que el cliente no proporcionó ciertos documentos o información requerida. Por lo que dependiendo de la variable, puede llegar a ser un indicador de mayor riesgo.¶
Imputar valores nulos (Variables Continuas)¶
A continuación, se generaron dos listas para imputar los valores faltantes en el conjunto de datos, una para la imputación con la media y otra para la imputación con la mediana.¶
La decisión de utilizar estos métodos se basó en el porcentaje de valores faltantes en cada variable. Las que presentaban un porcentaje de valores nulos menor o igual al 30% se imputaron con la media, ya que se asumió que estos valores presentaban una distribución relativamente uniforme y no afectarían significativamente las relaciones entre las variables. Sustituir por la media es apropiado cuando los datos no contienen outliers representativos y presentan una distribución simétrica o normal.¶
Por otro lado, las variables con un porcentaje de valores nulos superior al 30% fueron imputadas con la mediana, debido a que esta es más robusta frente a los outliers y las distribuciones segmentadas. En el caso de que se hubieran imputado estos datos con la media, podría distorsionar el análisis debido a los valores atípicos o una distribución sesgada.¶
In [18]:
lista_imputar_media = []
lista_imputar_mediana = []
for variable in df_train[lista_var_con]:
if variable in ['AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_2', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE']:
lista_imputar_media.append(variable)
else:
lista_imputar_mediana.append(variable)
print("Lista Imputar Media:", lista_imputar_media)
print("Lista Imputar Mediana:", lista_imputar_mediana)
Lista Imputar Media: ['AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_2', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE'] Lista Imputar Mediana: ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'EXT_SOURCE_1', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']
En la siguiente seccion, generamos una copia de la base de datos para así mantener la integridad de estos y facilitar la gestión a lo largo de las etapas del proceso de análisis.¶
In [19]:
copia_df_train = df_train.copy()
copia_df_test = df_test.copy()
In [20]:
# Imputar con Media
copia_df_train[lista_imputar_media] = copia_df_train[lista_imputar_media].apply(lambda x: x.fillna(x.mean()))
copia_df_test[lista_imputar_media] = copia_df_test[lista_imputar_media].apply(lambda x: x.fillna(x.mean()))
# Imputar con Mediana
copia_df_train[lista_imputar_mediana] = copia_df_train[lista_imputar_mediana].apply(lambda x: x.fillna(x.median()))
copia_df_test[lista_imputar_mediana] = copia_df_test[lista_imputar_mediana].apply(lambda x: x.fillna(x.median()))
Rectificamos que ya no se presentan valores nulos en este tipo de variable¶
In [21]:
# Filtrar los valores nulos solo para las variables de lista_var_con
nulos_train_con = copia_df_train[lista_var_con].isnull().sum()
nulos_test_con = copia_df_test[lista_var_con].isnull().sum()
# Imprimir los valores nulos por variable en los dos DataFrames
print("Valores nulos por variable (copia_df_train) :")
print(nulos_train_con)
print("\nValores nulos por variable (copia_df_test) :")
print(nulos_test_con)
Valores nulos por variable (copia_df_train) : AMT_INCOME_TOTAL 0 AMT_CREDIT 0 AMT_ANNUITY 0 AMT_GOODS_PRICE 0 REGION_POPULATION_RELATIVE 0 DAYS_REGISTRATION 0 OWN_CAR_AGE 0 CNT_FAM_MEMBERS 0 EXT_SOURCE_1 0 EXT_SOURCE_2 0 EXT_SOURCE_3 0 APARTMENTS_AVG 0 BASEMENTAREA_AVG 0 YEARS_BEGINEXPLUATATION_AVG 0 YEARS_BUILD_AVG 0 COMMONAREA_AVG 0 ELEVATORS_AVG 0 ENTRANCES_AVG 0 FLOORSMAX_AVG 0 FLOORSMIN_AVG 0 LANDAREA_AVG 0 LIVINGAPARTMENTS_AVG 0 LIVINGAREA_AVG 0 NONLIVINGAPARTMENTS_AVG 0 NONLIVINGAREA_AVG 0 APARTMENTS_MODE 0 BASEMENTAREA_MODE 0 YEARS_BEGINEXPLUATATION_MODE 0 YEARS_BUILD_MODE 0 COMMONAREA_MODE 0 ELEVATORS_MODE 0 ENTRANCES_MODE 0 FLOORSMAX_MODE 0 FLOORSMIN_MODE 0 LANDAREA_MODE 0 LIVINGAPARTMENTS_MODE 0 LIVINGAREA_MODE 0 NONLIVINGAPARTMENTS_MODE 0 NONLIVINGAREA_MODE 0 APARTMENTS_MEDI 0 BASEMENTAREA_MEDI 0 YEARS_BEGINEXPLUATATION_MEDI 0 YEARS_BUILD_MEDI 0 COMMONAREA_MEDI 0 ELEVATORS_MEDI 0 ENTRANCES_MEDI 0 FLOORSMAX_MEDI 0 FLOORSMIN_MEDI 0 LANDAREA_MEDI 0 LIVINGAPARTMENTS_MEDI 0 LIVINGAREA_MEDI 0 NONLIVINGAPARTMENTS_MEDI 0 NONLIVINGAREA_MEDI 0 TOTALAREA_MODE 0 OBS_30_CNT_SOCIAL_CIRCLE 0 DEF_30_CNT_SOCIAL_CIRCLE 0 OBS_60_CNT_SOCIAL_CIRCLE 0 DEF_60_CNT_SOCIAL_CIRCLE 0 DAYS_LAST_PHONE_CHANGE 0 AMT_REQ_CREDIT_BUREAU_HOUR 0 AMT_REQ_CREDIT_BUREAU_DAY 0 AMT_REQ_CREDIT_BUREAU_WEEK 0 AMT_REQ_CREDIT_BUREAU_MON 0 AMT_REQ_CREDIT_BUREAU_QRT 0 AMT_REQ_CREDIT_BUREAU_YEAR 0 DAYS_BIRTH 0 DAYS_EMPLOYED 0 DAYS_ID_PUBLISH 0 HOUR_APPR_PROCESS_START 0 dtype: int64 Valores nulos por variable (copia_df_test) : AMT_INCOME_TOTAL 0 AMT_CREDIT 0 AMT_ANNUITY 0 AMT_GOODS_PRICE 0 REGION_POPULATION_RELATIVE 0 DAYS_REGISTRATION 0 OWN_CAR_AGE 0 CNT_FAM_MEMBERS 0 EXT_SOURCE_1 0 EXT_SOURCE_2 0 EXT_SOURCE_3 0 APARTMENTS_AVG 0 BASEMENTAREA_AVG 0 YEARS_BEGINEXPLUATATION_AVG 0 YEARS_BUILD_AVG 0 COMMONAREA_AVG 0 ELEVATORS_AVG 0 ENTRANCES_AVG 0 FLOORSMAX_AVG 0 FLOORSMIN_AVG 0 LANDAREA_AVG 0 LIVINGAPARTMENTS_AVG 0 LIVINGAREA_AVG 0 NONLIVINGAPARTMENTS_AVG 0 NONLIVINGAREA_AVG 0 APARTMENTS_MODE 0 BASEMENTAREA_MODE 0 YEARS_BEGINEXPLUATATION_MODE 0 YEARS_BUILD_MODE 0 COMMONAREA_MODE 0 ELEVATORS_MODE 0 ENTRANCES_MODE 0 FLOORSMAX_MODE 0 FLOORSMIN_MODE 0 LANDAREA_MODE 0 LIVINGAPARTMENTS_MODE 0 LIVINGAREA_MODE 0 NONLIVINGAPARTMENTS_MODE 0 NONLIVINGAREA_MODE 0 APARTMENTS_MEDI 0 BASEMENTAREA_MEDI 0 YEARS_BEGINEXPLUATATION_MEDI 0 YEARS_BUILD_MEDI 0 COMMONAREA_MEDI 0 ELEVATORS_MEDI 0 ENTRANCES_MEDI 0 FLOORSMAX_MEDI 0 FLOORSMIN_MEDI 0 LANDAREA_MEDI 0 LIVINGAPARTMENTS_MEDI 0 LIVINGAREA_MEDI 0 NONLIVINGAPARTMENTS_MEDI 0 NONLIVINGAREA_MEDI 0 TOTALAREA_MODE 0 OBS_30_CNT_SOCIAL_CIRCLE 0 DEF_30_CNT_SOCIAL_CIRCLE 0 OBS_60_CNT_SOCIAL_CIRCLE 0 DEF_60_CNT_SOCIAL_CIRCLE 0 DAYS_LAST_PHONE_CHANGE 0 AMT_REQ_CREDIT_BUREAU_HOUR 0 AMT_REQ_CREDIT_BUREAU_DAY 0 AMT_REQ_CREDIT_BUREAU_WEEK 0 AMT_REQ_CREDIT_BUREAU_MON 0 AMT_REQ_CREDIT_BUREAU_QRT 0 AMT_REQ_CREDIT_BUREAU_YEAR 0 DAYS_BIRTH 0 DAYS_EMPLOYED 0 DAYS_ID_PUBLISH 0 HOUR_APPR_PROCESS_START 0 dtype: int64
Tratamiento valores nulos (Variables Categóricas y Variables Booleanas)¶
In [22]:
lista_var_cat
Out[22]:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START']
In [23]:
lista_var_bool
Out[23]:
['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
En cada lista de tipo de variable, generamos un bucle para que itere sobre ella y obtenemos las variables con valores nulos. Es importante mencionar que, de manera previa, se visualizó que la categoría de variables booleanas no presentaba valores nulos. Por esta razón, se genera el mensaje en el bucle.¶
In [24]:
for variable in lista_var_cat:
valores_unicos = df_train[variable].unique()
valores_nulos = df_train[variable].isnull().sum()
print('Variable: ', variable)
print('Valores únicos: ', valores_unicos)
print('Valores nulos: ', valores_nulos)
Variable: NAME_CONTRACT_TYPE Valores únicos: ['Cash loans', 'Revolving loans'] Categories (2, object): ['Cash loans', 'Revolving loans'] Valores nulos: 0 Variable: CODE_GENDER Valores únicos: ['F', 'M', 'XNA'] Categories (3, object): ['F', 'M', 'XNA'] Valores nulos: 0 Variable: NAME_TYPE_SUITE Valores únicos: ['Unaccompanied', 'Spouse, partner', 'Family', 'Other_B', NaN, 'Children', 'Group of people', 'Other_A'] Categories (7, object): ['Children', 'Family', 'Group of people', 'Other_A', 'Other_B', 'Spouse, partner', 'Unaccompanied'] Valores nulos: 1029 Variable: NAME_INCOME_TYPE Valores únicos: ['Commercial associate', 'Working', 'Pensioner', 'State servant', 'Businessman', 'Unemployed', 'Student', 'Maternity leave'] Categories (8, object): ['Businessman', 'Commercial associate', 'Maternity leave', 'Pensioner', 'State servant', 'Student', 'Unemployed', 'Working'] Valores nulos: 0 Variable: NAME_EDUCATION_TYPE Valores únicos: ['Higher education', 'Secondary / secondary special', 'Incomplete higher', 'Lower secondary', 'Academic degree'] Categories (5, object): ['Academic degree', 'Higher education', 'Incomplete higher', 'Lower secondary', 'Secondary / secondary special'] Valores nulos: 0 Variable: NAME_FAMILY_STATUS Valores únicos: ['Married', 'Single / not married', 'Civil marriage', 'Separated', 'Widow', 'Unknown'] Categories (6, object): ['Civil marriage', 'Married', 'Separated', 'Single / not married', 'Unknown', 'Widow'] Valores nulos: 0 Variable: NAME_HOUSING_TYPE Valores únicos: ['House / apartment', 'Municipal apartment', 'With parents', 'Rented apartment', 'Co-op apartment', 'Office apartment'] Categories (6, object): ['Co-op apartment', 'House / apartment', 'Municipal apartment', 'Office apartment', 'Rented apartment', 'With parents'] Valores nulos: 0 Variable: OCCUPATION_TYPE Valores únicos: ['Laborers', 'Drivers', 'Accountants', NaN, 'Sales staff', ..., 'IT staff', 'Realty agents', 'HR staff', 'Secretaries', 'Cleaning staff'] Length: 19 Categories (18, object): ['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff', ..., 'Sales staff', 'Secretaries', 'Security staff', 'Waiters/barmen staff'] Valores nulos: 76940 Variable: REGION_RATING_CLIENT Valores únicos: [2, 3, 1] Categories (3, int64): [1, 2, 3] Valores nulos: 0 Variable: REGION_RATING_CLIENT_W_CITY Valores únicos: [2, 3, 1] Categories (3, int64): [1, 2, 3] Valores nulos: 0 Variable: ORGANIZATION_TYPE Valores únicos: ['Business Entity Type 2', 'Other', 'Business Entity Type 3', 'Restaurant', 'XNA', ..., 'Transport: type 1', 'Religion', 'Industry: type 13', 'Trade: type 5', 'Industry: type 8'] Length: 58 Categories (58, object): ['Advertising', 'Agriculture', 'Bank', 'Business Entity Type 1', ..., 'Transport: type 3', 'Transport: type 4', 'University', 'XNA'] Valores nulos: 0 Variable: FONDKAPREMONT_MODE Valores únicos: ['reg oper account', NaN, 'reg oper spec account', 'org spec account', 'not specified'] Categories (4, object): ['not specified', 'org spec account', 'reg oper account', 'reg oper spec account'] Valores nulos: 168215 Variable: HOUSETYPE_MODE Valores únicos: ['block of flats', NaN, 'specific housing', 'terraced house'] Categories (3, object): ['block of flats', 'specific housing', 'terraced house'] Valores nulos: 123328 Variable: WALLSMATERIAL_MODE Valores únicos: ['Panel', NaN, 'Block', 'Stone, brick', 'Mixed', 'Others', 'Wooden', 'Monolithic'] Categories (7, object): ['Block', 'Mixed', 'Monolithic', 'Others', 'Panel', 'Stone, brick', 'Wooden'] Valores nulos: 124975 Variable: CNT_CHILDREN Valores únicos: [2, 0, 1, 3, 4, ..., 7, 11, 12, 9, 19] Length: 15 Categories (15, int64): [0, 1, 2, 3, ..., 11, 12, 14, 19] Valores nulos: 0 Variable: NWEEKDAY_PROCESS_START Valores únicos: [3, 2, 1, 4, 5, 6, 7] Categories (7, int64): [1, 2, 3, 4, 5, 6, 7] Valores nulos: 0
In [25]:
col_cat = df_train.select_dtypes(include=['category']).columns.tolist()
for col in col_cat:
valores_nulos = df_train[col].isnull().sum()
tipo_variable = df_train[col].dtype
valores_unicos = df_train[col].unique()
if valores_nulos > 0:
print(f"Variable: {col}")
print(f" - Valores faltantes: {valores_nulos}")
print(f" - Tipo de variable: {tipo_variable}")
print(f" - Valores únicos: {valores_unicos}")
print("-" * 90)
Variable: NAME_TYPE_SUITE - Valores faltantes: 1029 - Tipo de variable: category - Valores únicos: ['Unaccompanied', 'Spouse, partner', 'Family', 'Other_B', NaN, 'Children', 'Group of people', 'Other_A'] Categories (7, object): ['Children', 'Family', 'Group of people', 'Other_A', 'Other_B', 'Spouse, partner', 'Unaccompanied'] ------------------------------------------------------------------------------------------ Variable: OCCUPATION_TYPE - Valores faltantes: 76940 - Tipo de variable: category - Valores únicos: ['Laborers', 'Drivers', 'Accountants', NaN, 'Sales staff', ..., 'IT staff', 'Realty agents', 'HR staff', 'Secretaries', 'Cleaning staff'] Length: 19 Categories (18, object): ['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff', ..., 'Sales staff', 'Secretaries', 'Security staff', 'Waiters/barmen staff'] ------------------------------------------------------------------------------------------ Variable: FONDKAPREMONT_MODE - Valores faltantes: 168215 - Tipo de variable: category - Valores únicos: ['reg oper account', NaN, 'reg oper spec account', 'org spec account', 'not specified'] Categories (4, object): ['not specified', 'org spec account', 'reg oper account', 'reg oper spec account'] ------------------------------------------------------------------------------------------ Variable: HOUSETYPE_MODE - Valores faltantes: 123328 - Tipo de variable: category - Valores únicos: ['block of flats', NaN, 'specific housing', 'terraced house'] Categories (3, object): ['block of flats', 'specific housing', 'terraced house'] ------------------------------------------------------------------------------------------ Variable: WALLSMATERIAL_MODE - Valores faltantes: 124975 - Tipo de variable: category - Valores únicos: ['Panel', NaN, 'Block', 'Stone, brick', 'Mixed', 'Others', 'Wooden', 'Monolithic'] Categories (7, object): ['Block', 'Mixed', 'Monolithic', 'Others', 'Panel', 'Stone, brick', 'Wooden'] ------------------------------------------------------------------------------------------
In [26]:
col_bool = df_train.select_dtypes(include=[bool]).columns.tolist()
# Determinar que no existen valores nulos
hay_valores_nulos = False
for col in col_bool:
valores_nulos = df_train[col].isnull().sum()
tipo_variable = df_train[col].dtype
valores_unicos = df_train[col].unique()
if valores_nulos > 0:
hay_valores_nulos = True
print(f"Variable: {col}")
print(f" - Valores faltantes: {valores_nulos}")
print(f" - Tipo de variable: {tipo_variable}")
print(f" - Valores únicos: {valores_unicos}")
print("-" * 90)
else:
print(f"Variable: {col} - No tiene valores nulos")
# En el caso de que no se encuentren valores nulos
if not hay_valores_nulos:
print("Ninguna variable tiene valores nulos.")
Ninguna variable tiene valores nulos.
Valor de Cramérs V.¶
El proósito de calcular este valor, es medir la fuerza de asociación entre dos variables categóricas, indicando que tan relacionados están. El rango de este valor va de 0 a 1, donde entre más cercano sea a uno, mayor es la fuerza de asociación. A pesar de que mide la relación entre variables, no nos determina la causalidad entre estas, debido a que no infiere en que una genere a la otra.¶
In [27]:
for variable in lista_var_cat:
print('-'*90)
print('Matriz de confusión {variiable} con respecto a TARGET:')
confusion_matriz = pd.crosstab(df_train['TARGET'], df_train[variable])
print(confusion_matriz)
valor_cramer = f_aux.cramers_v(confusion_matrix = confusion_matriz.values)
print('Valor de Cramers:', valor_cramer )
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_CONTRACT_TYPE Cash loans Revolving loans
TARGET
0 203988 22160
1 18572 1288
Valor de Cramers: 0.030647843080174268
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
CODE_GENDER F M XNA
TARGET
0 150553 75593 2
1 11334 8526 0
Valor de Cramers: 0.05451190495295015
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_TYPE_SUITE Children Family Group of people Other_A Other_B Spouse, partner Unaccompanied
TARGET
0 2479 29690 202 636 1293 8354 182522
1 194 2390 19 61 139 714 16286
Valor de Cramers: 0.00969475405832943
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_INCOME_TYPE Businessman Commercial associate Maternity leave Pensioner State servant Student Unemployed Working
TARGET
0 9 52925 3 41769 16494 11 12 114925
1 0 4348 0 2365 1024 0 5 12118
Valor de Cramers: 0.06202037950610154
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_EDUCATION_TYPE Academic degree Higher education Incomplete higher Lower secondary Secondary / secondary special
TARGET
0 122 56576 7574 2690 159186
1 2 3226 715 335 15582
Valor de Cramers: 0.056615705136302014
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_FAMILY_STATUS Civil marriage Married Separated Single / not married Unknown Widow
TARGET
0 21427 145313 14575 32743 2 12088
1 2385 11840 1265 3616 0 754
Valor de Cramers: 0.04198246008049476
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NAME_HOUSING_TYPE Co-op apartment House / apartment Municipal apartment Office apartment Rented apartment With parents
TARGET
0 804 201363 8158 1961 3395 10467
1 78 17020 758 136 478 1390
Valor de Cramers: 0.03695484251026577
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
OCCUPATION_TYPE Accountants Cleaning staff Cooking staff Core staff Drivers HR staff High skill tech staff IT staff Laborers Low-skill Laborers Managers Medicine staff Private service staff Realty agents Sales staff Secretaries Security staff Waiters/barmen staff
TARGET
0 7517 3384 4266 20683 13261 412 8522 398 39602 1389 16032 6391 1956 546 23148 972 4798 926
1 365 357 501 1410 1683 31 571 26 4657 307 1064 480 138 48 2458 75 574 120
Valor de Cramers: 0.08109692091074842
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REGION_RATING_CLIENT 1 2 3
TARGET
0 24506 167318 34324
1 1249 14313 4298
Valor de Cramers: 0.05890284619889794
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
REGION_RATING_CLIENT_W_CITY 1 2 3
TARGET
0 25993 169111 31044
1 1330 14511 4019
Valor de Cramers: 0.06135786316089503
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
ORGANIZATION_TYPE Advertising Agriculture Bank Business Entity Type 1 Business Entity Type 2 Business Entity Type 3 Cleaning Construction Culture Electricity Emergency Government Hotel Housing Industry: type 1 Industry: type 10 Industry: type 11 Industry: type 12 Industry: type 13 Industry: type 2 Industry: type 3 Industry: type 4 Industry: type 5 Industry: type 6 Industry: type 7 Industry: type 8 Industry: type 9 Insurance Kindergarten Legal Services Medicine Military Mobile Other Police Postal Realtor Religion Restaurant School Security Security Ministries Self-employed Services Telecom Trade: type 1 Trade: type 2 Trade: type 3 Trade: type 4 Trade: type 5 Trade: type 6 Trade: type 7 Transport: type 1 Transport: type 2 Transport: type 3 Transport: type 4 University XNA
TARGET
0 302 1762 1890 4432 7703 49388 188 4754 278 701 426 7768 733 2184 751 79 1988 281 44 344 2364 649 435 82 964 16 2539 449 5156 229 8379 2009 229 12310 1769 1598 287 65 1274 6760 2333 1517 27450 1175 415 247 1419 2514 50 39 472 5673 152 1630 812 3919 999 41773
1 29 213 110 388 728 5048 24 645 18 56 32 573 50 187 95 6 181 12 7 31 286 67 31 5 81 3 181 27 396 21 606 104 21 997 92 140 37 4 168 419 274 81 3122 80 37 27 104 286 2 1 20 591 8 139 152 392 55 2370
Valor de Cramers: 0.07181461273694796
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
FONDKAPREMONT_MODE not specified org spec account reg oper account reg oper spec account
TARGET
0 4228 4218 54913 9038
1 346 277 4133 640
Valor de Cramers: 0.008709666149370406
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
HOUSETYPE_MODE block of flats specific housing terraced house
TARGET
0 112086 1091 912
1 8392 118 81
Valor de Cramers: 0.010835039064936313
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
WALLSMATERIAL_MODE Block Mixed Monolithic Others Panel Stone, brick Wooden
TARGET
0 6947 1667 1376 1203 49487 48023 3864
1 526 132 69 105 3367 3839 428
Valor de Cramers: 0.030261061414271043
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
CNT_CHILDREN 0 1 2 3 4 5 6 7 8 9 10 11 12 14 19
TARGET
0 158996 44679 19444 2657 282 63 13 7 2 0 2 0 1 1 1
1 13236 4427 1860 279 43 6 6 0 0 2 0 1 0 0 0
Valor de Cramers: 0.025614997270705864
------------------------------------------------------------------------------------------
Matriz de confusión {variiable} con respecto a TARGET:
NWEEKDAY_PROCESS_START 1 2 3 4 5 6 7
TARGET
0 37449 39585 38081 37219 36939 24975 11900
1 3141 3582 3403 3242 3302 2129 1061
Valor de Cramers: 0.0053830022738895695
In [28]:
for variable in lista_var_bool:
print('-'*90)
print('Matriz de confusión {variable} con respecto a TARGET:')
confusion_matriz = pd.crosstab(df_train['TARGET'], df_train[variable])
print(confusion_matriz)
valor_cramer = f_aux.cramers_v(confusion_matrix = confusion_matriz.values)
print('Valor de Cramers:', valor_cramer )
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
TARGET 0 1
TARGET
0 226148 0
1 0 19860
Valor de Cramers: 0.9999726127135284
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_OWN_CAR N Y
TARGET
0 148634 77514
1 13779 6081
Valor de Cramers: 0.020917624000671178
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_OWN_REALTY N Y
TARGET
0 69058 157090
1 6260 13600
Valor de Cramers: 0.005438185035782544
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_MOBIL 0 1
TARGET
0 1 226147
1 0 19860
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_EMP_PHONE 0 1
TARGET
0 41783 184365
1 2371 17489
Valor de Cramers: 0.04634411452463542
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_WORK_PHONE 0 1
TARGET
0 181749 44399
1 15136 4724
Valor de Cramers: 0.02821564665685336
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_CONT_MOBILE 0 1
TARGET
0 420 225728
1 42 19818
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_PHONE 0 1
TARGET
0 161896 64252
1 14998 4862
Valor de Cramers: 0.023718445281449046
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_EMAIL 0 1
TARGET
0 213309 12839
1 18742 1118
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
REG_REGION_NOT_LIVE_REGION 0 1
TARGET
0 222790 3358
1 19523 337
Valor de Cramers: 0.004231264273072046
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
REG_REGION_NOT_WORK_REGION 0 1
TARGET
0 214679 11469
1 18745 1115
Valor de Cramers: 0.0063669493983877
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
LIVE_REGION_NOT_WORK_REGION 0 1
TARGET
0 216872 9276
1 19007 853
Valor de Cramers: 0.0016623195533143134
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
REG_CITY_NOT_LIVE_CITY 0 1
TARGET
0 209352 16796
1 17493 2367
Valor de Cramers: 0.045581251151933795
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
REG_CITY_NOT_WORK_CITY 0 1
TARGET
0 175357 50791
1 13812 6048
Valor de Cramers: 0.051608603270209136
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
LIVE_CITY_NOT_WORK_CITY 0 1
TARGET
0 186219 39929
1 15419 4441
Valor de Cramers: 0.03325849436451175
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
EMERGENCYSTATE_MODE 0 1
TARGET
0 224466 1682
1 19685 175
Valor de Cramers: 0.0037283320336315645
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_2 0 1
TARGET
0 226140 8
1 19857 3
Valor de Cramers: 0.002979057679884047
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_3 0 1
TARGET
0 66828 159320
1 4404 15456
Valor de Cramers: 0.04423623442859539
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_4 0 1
TARGET
0 226126 22
1 19860 0
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_5 0 1
TARGET
0 222727 3421
1 19563 297
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_6 0 1
TARGET
0 205783 20365
1 18646 1214
Valor de Cramers: 0.027754414576280584
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_7 0 1
TARGET
0 226105 43
1 19857 3
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_8 0 1
TARGET
0 207584 18564
1 18401 1459
Valor de Cramers: 0.008323539864287808
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_9 0 1
TARGET
0 225258 890
1 19803 57
Valor de Cramers: 0.004097171764277719
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_10 0 1
TARGET
0 226144 4
1 19860 0
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_11 0 1
TARGET
0 225257 891
1 19800 60
Valor de Cramers: 0.0033536791216609795
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_12 0 1
TARGET
0 226147 1
1 19860 0
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_13 0 1
TARGET
0 225339 809
1 19837 23
Valor de Cramers: 0.011040517273742204
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_14 0 1
TARGET
0 225449 699
1 19833 27
Valor de Cramers: 0.008316733249016585
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_15 0 1
TARGET
0 225861 287
1 19852 8
Valor de Cramers: 0.006287931203058809
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_16 0 1
TARGET
0 223807 2341
1 19736 124
Valor de Cramers: 0.010977504954893965
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_17 0 1
TARGET
0 226088 60
1 19859 1
Valor de Cramers: 0.0025432021717690235
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_18 0 1
TARGET
0 224264 1884
1 19742 118
Valor de Cramers: 0.006871884485363943
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_19 0 1
TARGET
0 226008 140
1 19850 10
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_20 0 1
TARGET
0 226036 112
1 19848 12
Valor de Cramers: 0.0
------------------------------------------------------------------------------------------
Matriz de confusión {variable} con respecto a TARGET:
FLAG_DOCUMENT_21 0 1
TARGET
0 226081 67
1 19848 12
Valor de Cramers: 0.0037594804049241475
Conclusión sobre valor de Cramérs v.¶
A lo largo de los resultados obtenidos de los valores de Cramérs, obtenemos valores reducidos o débiles, donde donde los valores de Cramérs de las variables FLAG_MOBIL, FLAG_CONT_MOBILE, FLAG_EMAIL, FLAG_DOCUMENT_4, FLAG_DOCUMENT_7, FLAG_DOCUMENT_10, FLAG_DOCUMENT_12, FLAG_DOCUMENT_19, FLAG_DOCUMENT_20 y otras variables, tienen valores cercanos a 0, por lo que podemos empezar a establecer que estas variables las podemos considerar como irrelevantes para el modelado.¶
Es importante el no menospreciar variables con valores pequeños, más no cercanos a ceros, debido a que estos sumados nos podrían dar un impacto acumulativo al combinarlo con otras características. Algunas de las variables que presentan estos valores son FLAG_OWN_CAR, FLAG_PHONE, FLAG_DOCUMENT_3, REG_CITY_NOT_WORK_CITY, NAME_HOUSING_TYPE, y REG_CITY_NOT_LIVE_CITY¶
De igual manera tenemos una muestra de variables con mayor relevancia a las anteriores, como lo es CODE_GENDER (0.0545), indicando que el género esta relacionado con la variable objetivo, NAME_EDUCATION_TYPE (0.0566) representando el nivel educativo, OCCUPATION_TYPE (0.0811) tomando en cuenta la ocupación del solicitante y el ORGANIZATION_TYPE (0.0718), simbolizando el tipo de organización donde labora el solicitante. Estas variables de manera lógica aportan más que las anteriores, debido a que son aspectos que puedes escalar.¶
Imputar valores nulos (Variables Categóricas)¶
In [29]:
copia_df_train[lista_var_cat] = copia_df_train[lista_var_cat].astype("object").fillna("SIN VALOR").astype("category")
copia_df_test[lista_var_cat] = copia_df_test[lista_var_cat].astype("object").fillna("SIN VALOR").astype("category")
En el siguiente código se realiza una imputación de los valores nulos, al identificarlos por 'XNA' o 'NaN' y siendo sustituidos por 'SIN VALOR', esto debido a la variabilidad existente dentro de los diversos nombres asignados a un valor nulo.¶
In [30]:
# Train
categorical_columns = copia_df_train.select_dtypes(include=['object', 'category']).columns
copia_df_train[categorical_columns] = copia_df_train[categorical_columns].replace('XNA', 'SIN VALOR')
copia_df_train[categorical_columns] = copia_df_train[categorical_columns].replace('NaN', 'SIN VALOR')
# Test
categorical_columns = copia_df_test.select_dtypes(include=['object', 'category']).columns
copia_df_test[categorical_columns] = copia_df_test[categorical_columns].replace('XNA', 'SIN VALOR')
copia_df_test[categorical_columns] = copia_df_test[categorical_columns].replace('NaN', 'SIN VALOR')
# Verificar el cambio en las primeras filas
print(copia_df_test[categorical_columns].head())
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY ORGANIZATION_TYPE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE NWEEKDAY_PROCESS_START SK_ID_CURR 396899 Cash loans M Y Y 1 Family Working Higher education Married House / apartment Laborers 2 2 Transport: type 4 SIN VALOR SIN VALOR SIN VALOR 1 322041 Cash loans F N N 0 Family Working Secondary / secondary special Married House / apartment Laborers 2 2 Government SIN VALOR SIN VALOR SIN VALOR 4 220127 Cash loans M N Y 0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment Security staff 2 2 Industry: type 4 SIN VALOR block of flats Panel 4 251531 Cash loans F N N 0 Unaccompanied Working Higher education Single / not married House / apartment Core staff 3 3 Self-employed SIN VALOR SIN VALOR SIN VALOR 4 345558 Cash loans F N Y 0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment Laborers 2 2 Government SIN VALOR SIN VALOR SIN VALOR 6
Guardar CSV¶
In [31]:
copia_df_train.to_csv('... /data/train_df_preprocessing_missing_outlier.csv')
copia_df_test.to_csv('... /data/test_df_preprocessing_missing_outlier.csv')